Skip to content

Conversation

@pujaltes
Copy link

@pujaltes pujaltes commented May 16, 2024

Like gloo it appears that ccl does not support the ReduceOp.AVG operation (see example below). To avoid errors when using avg operation for reducing across devices I simply extended the checks PL already had in place for gloo.

Example of ccl backend error running mpiexec -n 4 python mpitest.py:

import torch
from torch.distributed.distributed_c10d import _get_default_group
import intel_extension_for_pytorch
import oneccl_bindings_for_pytorch
import os
from lightning.fabric.utilities.types import ReduceOp


os.environ["MASTER_ADDR"] = "pvc-s-162"
os.environ["MASTER_PORT"] = "29502"
os.environ["RANK"] = os.environ.get("PMI_RANK", "0")
os.environ["WORLD_SIZE"] = os.environ.get("PMI_SIZE", "1")
init_method = "env://"
print(f"RANK: {os.environ['RANK']}, WORLD_SIZE: {os.environ['WORLD_SIZE']}", flush=True)

torch.distributed.init_process_group(backend='ccl', init_method=init_method)
test_tensor = torch.rand(1, 1, 40966, dtype=torch.float16, device=f"xpu:{os.environ['RANK']}")
print(f"Device: {test_tensor.device}", flush=True)

# NOTE: The error ocurrs regardless of how you define the process group
group = torch.distributed.group.WORLD
# group = _get_default_group()

# op = ReduceOp.SUM  # Fine
op = ReduceOp.AVG  # Error
torch.distributed.all_reduce(test_tensor, group=group, async_op=False, op=ReduceOp.SUM)
print("DONE!", flush=True)

📚 Documentation preview 📚: https://pytorch-lightning--8.org.readthedocs.build/en/8/

jingxu10 and others added 2 commits April 28, 2024 05:57
[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

update typos and bug fixes

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

xpu seeding PR1

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

add seeding for pytorch utilities

mp_fabric xpu forking

xpu multiprocess pytorch

add header for xpu

rename

change to lightning.pytorch

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Teardown from lightning-xpu (from #PR- 3)

From Lightning-AI#3

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

add torch.xpu.stream to ddp

update docs

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

update _LIGHTNING_XPU_AVAILABLE to _lightning_xpu_available

correct fabric imports.py

1. remove xpu.py from _graveyard
2. correct _lightning_xpu_available() usage

fix _try_import function not defined issue in fabric

add docs

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

fix circle import issue

update pytorch trainer connector

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

correct usage in multiprocessing

Fix precision device

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

update warning format
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants